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ABSTRACT: Periocular biometrics has been established as an independent modality due to concerns 
on the performance of iris or face systems which is uncontrolled. This paper presents the evaluation 
and comparison of pericocular recognition techniques such as ReLU non-linearity, DeeplrisNet, 
FaceNet, Light CNN, Multimodal CNN, Deep CNN, RGB-OCLBCP and to applied on a dataset of images 
from UBIPr, CASIA-Iris and AR datasets. Besides the feature extraction protocol, we also present a 
comparison of several classifiers and performed a performance evaluation of the feature extraction 
techniques. Performance validation is done based on the rank 1, rank 5, EER and ROC. Therefore, from 
the analysis performance evaluation report we found that RGB-OCLBCP outperforms the other 


techniques. 
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1. INTRODUCTION 


The human periocular region, which includes 
the eye, eyelids, eyelashes, eyebrows, and skin 
texture near the eye, is an emerging biometric 
trait for human identification. Interestingly, it is 
possible to capture periocular images from a 
distance without imposing more constraints on 
participants than in face biometrics, and while 
achieving excellent recognition accuracy similar 
to iris biometrics. There have been too few 
studies that have looked at the periocular region 
of the eye without including the iris in recent 
years. There have been several applications of 
deep learning in periocular which have shown 
interesting results. While deep models have 
many potential applications, one of the major 
challenges is that they require a large amount of 
training data, which is not readily available. 
Incorporating human-expert knowledge in the 
design process and being able to apply the 
feature representation to small datasets are two 
advantages of —hand-designed feature 
representations. A number of different 
classification techniques are available, such as 
k-Nearest Neighbour (k-NN), Support Vector 
Machine (SVM), Artificial Neural Network 
(ANN) and Gaussian Mixture Model (GMM) [1]. 
Among these techniques SVM is one of the most 
popular classification techniques widely used in 
pattern recognition applications such as image 
classification [2], remote sensing [3], biological 
and other sciences [4]. 


In this study we focused on evaluating the 
deep neural network from various techniques 
such as ReLU_ non-linearity, DeeplrisNet, 
FaceNet, Light CNN, Multimodal CNN, Deep CNN, 
RGB-OCLBCP. This work finds a classification 
hyperplane that maximizes its distance from the 
nearest data point on each side. Future research 
should compare the performance of these 
techniques exhaustively. Additionally, the 
community should investigate the most effective 
way to design deep networks [5]. Therefore, this 
study evaluates various techniques adopted by 
researchers. The remainder of the this paper 
covers a background study of periocular region 
in Section 2, State-of-the-art Deep Neural 
Network Algorithms in Section 3, performance 
evaluation on recognition in Section 4, 
discussion of the results is briefed in Section 5 
and Section 6 summarizes the overall paper. 


2. BACKGROUND STUDY 


The field of biometrics has been actively 
researching periocular recognition since 2009. 
A variety of methods for extracting identity 
from the periocular region have been explored. 
In comparison to the full face, a periocular scan 
which is a scan of one sub-region of the face, 
shows less distortion and occlusion resulting in 
greater stability and accuracy. 


In a visual or near-infrared camera, the 
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periocular region is imaged initially. This 
periocular image has been validated by Park et 
al. [6,7]. The periocular area is often a 
rectangular area defined by the center of the eye 
or the inner and outer corners. Feature 
extraction is the next step. One of the most 
difficult problems with periocular recognition is 
choosing features that represent reliable and 
unique characteristics of the periocular region. 
Features are compared to training features to 
find a match. Various classification approaches 
are used to collate test data [8]. 


As a result, the conclusion of the study is used 
to determine which classification strategies 
should be used. Miller et al. [9,10] examined 
how blurring, resolution, and lighting affect the 
robustness of appearance-based periocular 
recognition. The largest performance decrease 
was observed with high resolution images. Xu 
et al. [11] evaluated a series of features to find 
the best feature extraction method that could be 
used in standalone mode without needing 
generic or discriminant subspace training. The 
local binary Walsh Transform model performs 
better than the others, especially when used 
with kernel correlation feature analysis (KCFA). 


Deep learning has recently been used for 
periocular recognition. The ability of deep 
learning algorithms to learn _ feature 
representation automatically and directly from 
training data, rather than by human assumption 
as in existing handcrafted features, is their main 
advantage. Among the most prevalent types of 
neural networks, convolutional neural networks 
(CNNs) were used by Ahuja et al. [12]. Proenca 
et al. have attempted to improve the feature 
learning process using CNNs by implicitly 
identifying the areas of interest in the input data 
that should be prioritized, rather than blocking 
off any areas in the test/training samples [13]. 
For supervised learning, they used a four-layer 
stacked convolutional network followed by a 
512-dimensional feature vector, along with 
cosine similarity for testing [14]. Zhao et al., 
proposed to clearly focus on the critical pulse 
regions and add higher weights to allow these 
important regions to have a greater impact on 
the recognition process [15]. The more recent 
proposal of Zhao et al is to add higher weights to 
the periocular region so as to allow those 
critical areas to have a greater impact on the 
recognition process [16]. 


To be noted that periocular reputation 
primarily based on CNNs has also been taken 
into consideration in the context of periocular 
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recognition with promising results [17]. Sharma 
et al. added a neural network-primarily based 
approach for gaining knowledge of the 
variations produced by diverse spectrums [18]. 
The first neural network was trained separately 
for each spectrum, and the second was 
combined to learn cross-spectral variability. 
Periocular pictures have few techniques to deal 
with scale,__translation, rotation, and 
illumination changes, despite a depth of 
knowledge. In these less-than-ideal settings, 
periocular biometrics will prove even more 
useful and be applied in a wide variety of 
applications [19]. 


3. Deep Neural Network Algorithms 
3.1 Method 1: ReLU non-linearity 


The algorithm proposed in by Krizhevsky et 
al. [20] trains the large deep learning neural 
network using the high resolution image dataset 
namely ImageNet LSVRC-2010. For efficient 
training of a system, non-saturating neurons 
and GPU implementation were made. To 
overcome the problem of overfitting in the 
connected layers of neurons, the “droupout” 
method for regularization is used [20]. 


The network's operating principle is 
described in detail in this publication. The 
network employed in this study contains eight 
layers, each with its own weights. The network 
uses multinomial logistic regression the most 
for prediction distribution [21]. In this situation, 
kernel maps from the preceding layer are only 
connected to kernel maps from the second, 
fourth, and fifth convolutional layers. The 
second layer's kernel mappings are all related to 
the third convolutional layer's kernels. All of the 
neurons in the fully-connected layers are 
connected to the neurons in the layer before 
them. The ReLU non-linearity is applied to the 
output of every convolutional and_ fully- 
connected layer. 


The input image is filtered by the first 
convolutional layer, and the output of the first 
convolutional layer is filtered by the second 
convolutional layer using 256 kernels. Between 
the third, fourth, and fifth convolutional layers, 
there are no pooling or normalizing layers. The 
outputs of the second convolutional layer are 
connected to the third convolutional layer. The 
fourth convolutional layer has 384 kernels, 
while the fifth convolutional layer has 192 
kernels. Each of the completely connected layers 
contains 4096 neurons. 
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As a result, applying supervised learning in a 
deep convolutional neural network, this 
approach can perform better. If one of the 
convolutional layers is deleted, the network's 
limits may deteriorate. In the network, it lost 
around 2% to 8% of its performance. 


3.2 Method 2: DeepIrisNet 


Deep CNNs are used in the DeeplrisNet 
method. DeeplrisNet's experimental analysis 
revealed that it can handle the microstructures 
of the iris very efficiently and _ provides 
discriminative, solid, and robustness in terms of 
accuracy while implementing it in our 
comparative study work. As a result of 
experimenting with this technique, we were 
able to achieve a considerable increase in cross- 
sensor recognition accuracy [22]. Because it can 
handle large-scale iris data with complex 
distributions, DeeplrisNet is frequently 
employed. The author has_ meticulously 
designed for optimal iris representation and 
effective CPU resource utilization. Dropout 
learning [23], small filter size [24], very deep 
architecture [25,26], rectified linear non- 
linearity (ReLU) [27], batch normalization [28], 
and other popular CNN components are all 
integrated into the architecture of this model. 
They employed four distinct well-known 
segmentation methods to test the resilience of 
this technique: CAHT [29], WAHET [30], Osiris 
[31], and IFFP [32]. 


DeepIrisNet-A and DeeplrisNet-B are the two 
models that they have used to separate their 
architecture. The former is based on standard 
convolutional layers, whereas the latter is based 
on inception layer stacking. In order to produce 
an efficient and effective iris representation, it is 
usually necessary to construct a good CNN 
architecture. DeeplrisNet-A has eight 
convolutional layers, with two degrees of 
pooling for a total of four pooling layers. Five 
convolutional layers (conv1 to conv5) are piled 
on top of each other in the DeeplrisNet-B 
network, followed by two inception layers (6 
and 7). Pooling will take place in the same way 
that it does in DeeplrisNet —A. 


Initially, a zero mean Gaussian distribution 
with standard deviation is created using the 
weights of CNNs. In all hidden layers, this 
technique, like Method 1 in Section 3.1, uses the 
Rectified Linear Units (ReLU) activation 
function. Without any preprocessing, we fed the 
UBIPr, CASIA-Iris and AR datasets into Deeplris 
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for effective feature extraction. When the 
segmentation strategy changes, DeepIrisNet-A is 
clearly more robust than baseline, according to 
our experimental analysis from method 2. 


3.3 Method 3: FaceNet 


The FaceNet framework [33] demonstrated a 
combined clustering, recognition, and 
verification system. This method is based on the 
Euclidean learning model, which is used in 
conjunction with Deep Convolutional Networks 
(DCN). FaceNet's mapping was done directly 
with the use of facial similarity, which 
corresponds to Euclidean space. The embedding 
process will be improved by using DCN. The 
triplet model is used to train the DCN by 
identifying identical and non-identical faces 
[34]. 


The FaceNet used the Labeled Faces in the 
Wild (LFW) dataset and got 99.63 percent 
accuracy for the LFW dataset and 95.12 percent 
accuracy for the YouTube Faces DB dataset. We 
used the same approach on datasets such UBIPr, 
CASIA-Iris and AR and the accuracy rate is 
shown in Table 1, Table 2, and Table 3. The 
main architectures used in this approach are the 
Zeiler & Fergus [35] style networks and the 
more contemporary Inception type networks. In 
all of these models, the rectified linear units are 
used as the non-linear activation function. The 
input size of a picture can range from 96x96 
pixels to 224x224 pixels. The experiment was 
also carried out using Stochastic Gradient 
Descent (SGD) with standard back propagation 
algorithm [36,37] to train the CNN. The 
performance of the inception-based 
architecture is superior. By applying this 
approach to the dataset that we investigate in 
this paper, we have discovered that it is only 
effective with low-resolution photos. When used 
with high-resolution datasets, it performs a little 
worse. 


3.4 Method 4: Light CNN 


In method 4, Light CNN architecture was 
proposed for learning a compact embedding 
from a large number of noisy labeled datasets. 
To study the variation of max out activation in 
the CNN, they devised the Max-Feature-Map 
(MFM) model. This MFM model can not only 
distinguish between noisy and_ informative 
signals, but it can also choose features between 
two feature maps. In CNN, the semantic 
bootstrapping method is utilized to predict 
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noisy labels.A single model with a 256-D 
representation extracts the face faster than 
other models using only a few parameters. 
There is no idea of fine tuning for re-training the 
network in this model; instead, it will extract 
features and compute their similarity in cosine 
similarity. 


By using this experimental technique on the 
dataset such as UBIPr, CASIA-Iris and AR we 
used in our research, we were able to achieve 
superior results by lowering the overall number 
of parameters used. It also works with 
enormous amounts of noisy data to train a Light 
model that is both computationally and spatially 
efficient [38]. 


3.5 Method 5: Multimodal CNN 


To fuse several modalities of person 
verification, a deep multimodal fusion network 
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is used [39]. Multiple streams of modality- 
specific CNNs make up this framework. To 
extract numerous. characteristics, several 
distinct convolutional layers are used. Features 
retrieved from several convolutional layers are 
used to represent the input at various degrees 
of abstraction. This model reduces the amount 
of network parameters. To decrease the amount 
of network parameters without affecting 
performance accuracy, they used compressed 
feature representations of all modalities 
integrated at the _ fully-connected layers. 
Modality-specific layers are used to represent 
features for later fusion rather than performing 
spatial fusion at the convolutional layer. In this 
design, all of the CNNs and the classifier are 
optimized together. They also investigated into 
combining multi-level abstract features for 
biometric person identification. 


/ Pool-4 


Conv4 Conv5 


Fig 1 Multi-level feature abstraction from a modality-dedicated network [43] 


The feature vector is transformed by f in 
Figure 1. As a result, the modality-specific 
embedding layers reflect the network's 
modality. The f function translates a one- 
dimensional feature vector from a feature map 
space to a one-dimensional feature vector. After 
then, the classification procedure takes into 
account a mix of vectors from various levels of 
abstraction. 


When the feature maps of distinct modalities 
have the same spatial dimensions, fusion can be 
conducted on the feature maps of CNNs. Before 
the fusion, each modality is represented by the 
output of its final fully-connected layer or by a 
set of CNN layers that represent abstract levels. 
Modality-dedicated embedding layers are also 
known as fully connected layers. The author 
used the BIOMDATA database [40]. This dataset 
contains a variety of noisy photos, including 
blur, sensor noise, and shadows [41]. This 
database has a significant privacy problem. In 


this study, multiabstract networks were used to 
solve the geographical mismatch problem. As a 
result, this experiment shows that multimodal 
person identification performance has 
improved. In the case of recognition, the fully- 
connected representation yields encouraging 
results. 


3.6 Method 6: Deep CNN 


Schuckersel al., [42] has made _ two 
contributions. First, they looked at how a very 
large dataset may be built using a automation 
loop. Second, the complexity of deep network 
training and face recognition is examined in 
order to present methods and procedures for 
getting results on the standard Labelled Faces in 
the Wild (LFW) and YTF face benchmarks that 
are comparable to the state of the art. They use 
the unique dataset to train and analyze 
alternative CNN architectures for face 
identification and verification, including face 
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alignment and metric learning. The end result is 
simpler but still effective. They talked about two 
different methods: shallow and deep. Using 
constructed local image descriptors such as 
SIFT, LBP, and HOG, the Shallow techniques 
begin by obtaining a representation of the facial 
picture. After that, a pooling method is used to 
merge it with local descriptors. This method 
employs a deep CNN that was trained on a 
dataset of 4 million instances spanning 4000 
distinct identities to categories faces. It also 
employs a siamese network design, in which the 
same CNN is used to generate descriptors for 
pairs of faces, which are then compared using 
the Euclidean distance. They have also 
developed a method for selecting identities to 
train the network called bootstrapping. The 
steps for experimental methodology are stated 
below. 


eStep 1: Create a list of potential identity names. 
eStep 2: Sorting through a list of potential 
identity names. 

eStep 3: Automatically screening and collecting 
more photos for each identification. 

eStep 4: Remove any near duplicates. 

eStep 5: Manual filtering is completed. 


eesti) =e 
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Therefore, using manual annotation methods, 
the author devised a procedure for arranging 
low-level noise data. This method was created 
for faces, but it can also be used for other 
objects. 


3.7 Method 7: RGB-OCLBCP 


The dual-stream CNN methods’ RGB- 
Orthogonal Combination-Local Binary Coded 
Pattern (RGB-OCLBCP) recognize RGB pictures 
and color-based texture descriptors. In this 
study, the Ethnic-ocular database of periocular 
was used. We used the UBIPr, CASIA-iris, and 
AR in our paper to implement this framework 
[43] and the illustration of this technique is 
represented in Figure 2. 


The OCLBCP employs the color information in 
the periocular texture to better depict 
periocular features for recognition. The 
parameters are shared between the two 
networks, and the last layer undergoes late 
fusion. The combination of RGB picture and a 
novel texture descriptor, as well as the 
complexity of CNN and input characteristics, 
were explored and analyzed in this research. 
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Fig 2 Illustration of Orthogonal Combination-Local Binary Coded Pattern (OCLBCP) 


The Ethic-ocular database is a large-scale 
collection of in-the-wild periocular pictures 
from a variety of ethnic groups. This database 
was designed for periocular recognition and 
contains left and right oculars extracted from 
85,394 photos retrieved from the web. The 
network was designed to train their designed 
protocol using a 70:15:15 ratios for training, 
testing, and validations. The task for the 
recognition task was to determine which of the 
gallery identities was represented by the probe 
set by gathering images about a specific set of 
individuals to be recognized and presenting a 
new image; the task was to determine which of 


the gallery identities was represented by the 
probe set. 


Therefore, this selection procedure was 
carried out three times in total. We investigated 
the space and time complexity of a single 
network while testing this model. The OCLBCP 
descriptor has the potential to improve 
recognition performance. When compared to 
various rival networks on these databases, this 
network model excelled them in_ both 
recognition and verification tests. 


4. Performance Evaluation on 
Recognition 
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4,1. Evaluation on UBIPr Dataset 


Table 1: UBIPrdataset 


Method 1 84.88 96.01 7.11 
Method 2 90.3 97.41 5.07 
Method 3 90.24 97.36 5.46 
Method 4 90.28 97.18 6.34 
Method 5 90.75 97.44 4.09 
Method 6 90.24 97.09 4.38 
Method 7 91.28 98.59 3.41 
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Fig 5: EER comparison for UBIPr dataset 
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Fig 6: ROC curve for UBIPr dataset 
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4,1 Evaluation on CASIA-Iris Dataset 


Table 2: CASIA-Iris dataset 


Method 1 95 96.98 8.06 
Method 2 95.95 98.15 7.51 
Method 3 96.09 98.1 6.1 

Method 4 96.01 97.85 6.34 
Method 5 95.81 97.67 8.69 
Method 6 95.88 97.99 7.42 
Method 7 96.62 98.45 4.35 
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4.1 Evaluation on ARDataset 
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Table 3: AR dataset 


Method 1 93.59 96.75 14.53 
Method 2 95.24 98.38 7.69 
Method 3 94.19 97.75 9.4 
Method 4 94.27 97.52 9.39 
Method 5 96.07 98.71 7.69 
Method 6 94.2 97.61 7.69 
Method 7 96.32 98.8 5.13 
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Fig 12: Rank-5 recognition rate comparison 
for AR dataset 


Fig 11: Rank-1 recognition rate comparison for 


AR dataset 
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Fig 13: EER comparison for AR dataset 


The methods experimented in this work are 
ReLU non-linearity, DeeplIrisNet, FaceNet, Light 
CNN, Multimodal CNN, Deep CNN, RGB-OCLBCP 
and their results are tabulated in Table 1, Table 
2 and Table 3 respectively. The qualitative 
metrics such as Rank1, Rank 5, EER and Roc are 
evaluated for existing methods to study and 
compare the best technique over one another. 
The UBIPr dataset results for the above 


Method 


True Positive Rate 


False Positive Rate 


Fig 14: ROC curve for AR dataset 


mentioned techniques are tabulated in Table 1 
and the respective statistical measure is given in 
Figure 3-6. Likewise CASIA-Iris dataset results 
for the above mentioned techniques are 
tabulated in Table 2 and the _ respective 
statistical measure is given in Figure 7-10. The 
AR dataset results for the above mentioned 
techniques are tabulated in Table 3 and the 
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respective statistical measure is given in Figure 


Our experimental analysis and_ findings 
revealed that the RGB-OCLBCP descriptor can 
be used to use the fair features as inputs for a 
more robust periocular recognition system. 
Additionally, the network utilizes color-based 
texture information, resulting in more robust 
feature representations for issues in recognition 
and verification in the wild. The limitations of 
periocular recognition and verification are 
greater than those of the limited setting. RGB- 
OCLBCP networks are able to recognize images 
better due to their ability to learn new features, 
as demonstrated by experimental data from 
seven approaches tested. Based on_ the 
efficiency of the network's fusion layers, our 
belief is that multi-feature learning can do 
significantly better than RGB images alone in 
periocular recognition. 


CONCLUSION 


The performance of seven distinct periocular 
recognition techniques is evaluated in this study 
such as ReLU_ non-linearity, DeeplrisNet, 
FaceNet, Light CNN, Multimodal CNN, Deep 
CNN, RGB-OCLBCP. These techniques are 
applied to three different dataset images 
namely UBIPr, CASIA-Iris and AR. The 
comparison of these approaches reveals that, 
depending on the real-world application for 
which a system is designed, each algorithm 
exhibits its benefits and drawbacks in different 
ways. From this deep evaluation of their 
performance, we found that RGB-OCLBCP 
technique gives much promising outcomes 
when compared with other techniques. 
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